-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change TerminatePhase return value #6335
base: develop
Are you sure you want to change the base?
Conversation
Now TerminatePhase returns Parallel::AlgorithmExecution::Halt. Also added a new action PausePhase which returns Parallel::AlgorithmExecution::Pause if that is needed. This changes how the algorithm ends at the end of a phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, but I think you need to respond to @kidder's concerns from elsewhere, reproduced below:
- Retry tells the distributed object to wait for more data before resuming the algorithm. The algorithm is still considered to be running on the distributed object. A subsequent call to receive_data will attempt to run the same action after putting data into the Inbox
- Halt stops the algorithm. Nothing can restart the algorithm of a halted distributed object until the new phase is determined. This is used to execute a triggered phase change such as checkpointing or AMR
- Pause stops the algorithm, but lets either something calling receive_data(instance, data, true) or perform_algorithm(true) (where the trues override defaulted false arguments) on the paused distributed object to restart the algorithm.
- The member variable terminate_ is true if the algorithm is stopped. This happens for either Halt or Pause being called
- The member variable halt_algorithm_until_next_phase_ is set true when Halt is called
- There are currently two instances of perform_algorithm(true), one is in an example Geoffrey uses in his tutorial, the other is in the linear solver action BuildMatrix
- So calling perform_algorithm() on a Paused or Halted object should not have done anything. So I am confused by the [diagnosis of a restart being caused by a reduction]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of code changes. Still need to be convinced this isn't just hiding some other bug.
step_actions>>; | ||
// This test doesn't operate exactly like how SelfStart would normally work in | ||
// an executable. Instead it jumps around quite a lot. Therefore to avoid any | ||
// issues that TerminatePhase would cause, we just replace it with PausePhase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove it. No reason to replace it with a no-op.
@@ -209,10 +209,10 @@ SPECTRE_TEST_CASE( | |||
std::make_unique<::TimeSteppers::DormandPrince5>())); | |||
|
|||
// Run the initializations | |||
for (size_t i = 0; i < 6; ++i) { | |||
for (size_t i = 0; i < 5; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a ton of unexplained changes like this in this PR. What is going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 5 actions in the initialization phase for this component. Previously the last one returned Pause
, so when it called ActionTesting::next_action
one extra time, nothing bad happened. But now that the last action returns Halt
, calling it an extra time causes an error. So it was a bug previously to call it 6 times. Same for the rest of the changes like this
Now TerminatePhase returns
Parallel::AlgorithmExecution::Halt
. Also added a new action PausePhase which returnsParallel::AlgorithmExecution::Pause
if that is needed.This changes how the algorithm ends at the end of a phase.
Proposed changes
This is a fix for a bug in nodegroup parallization, but also a good thing to do. Once we get to the end of an PDAL (and we aren't looping), we shouldn't be allowed to restart as many phases are only intended to run once
Upgrade instructions
Code review checklist
make doc
to generate the documentation locally intoBUILD_DIR/docs/html
.Then open
index.html
.code review guide.
bugfix
ornew feature
if appropriate.Further comments